From Surveys to Populations

GVPT399F: Power, Politics, and Data

Surveys

  • Populations are very difficult to collect data on

    • Even the census misses people!
  • Happily, we can use surveys of a sample of our population to learn things about our population

  • However, our ability to do this is conditional on how good our sample is

  • What do I mean by “good”?

The 2024 US Presidential Election

  • Elections are preceded by a flood of surveys

Surveys

  • Surveys are conducted on a subset (sample) of the population of interest

  • Our population of interest: individuals who voted in the 2024 US Presidential Election

A good sample

  • A good sample is a representative one

  • How closely does our sample reflect our population

Parallel worlds

  • Remember back to last session on experiments

  • In an ideal world, we would be able to create two parallel worlds (one with the treatment, one held as our control)

    • One version of the election booth run without monitors (the control)

    • One version with monitors (the treatment)

  • These worlds are perfectly identical to each other prior to treatment

  • We cannot do this :(

The next best thing

  • Our next best option is to create two groups that were as identical to one another as possible prior to treatment

  • If they are (almost) identical, differences between their group-wide outcomes can be attributed to the treatment

  • One good way of getting two (almost) identical groups is to assign individuals to those groups randomly

    • Think back to our 1,000 hypothetical people!

Randomization

  • Randomization continues to pop its chaotic head up

  • We can use it to create a sample that is (almost) identical to our population, on average

  • Drawing randomly from our population increases our chances of ending up with a sample that reflects that population

  • This would be referred to as a representative sample

Random sampling

  • All individuals in the population need to have an equal chance of being selected for the sample

    • If this holds, you have a pure random sample
  • This is really hard to do!

    • How likely were you to answer the pollster’s unknown number, calling you in the middle of the day?

    • Even if you did answer, how likely were you to answer all their questions?

Large numbers

  • Randomization isn’t enough: we also need to draw a sufficiently large sample from our population

    • One person pulled randomly from the class isn’t going to be very reflective of the class!

To illustrate

Countries’ GDP in 2022:

Countries’ GDP

I want to estimate the average GDP across all countries in 2022.

  • I send out a survey to all countries’ Departments of Statistics and ask for their GDP figures for 2022.

  • I get 60 responses:

sample_df <- gdp_df |> 
  drop_na(sample_value) |> 
  sample_n(size = 60) |> 
  transmute(country, gdp = sample_value)

sample_df
# A tibble: 60 × 2
   country                  gdp
   <chr>                  <dbl>
 1 Portugal             2.55e11
 2 Bolivia              4.40e10
 3 Peru                 2.46e11
 4 Japan                4.26e12
 5 Kuwait               1.83e11
 6 Viet Nam             4.10e11
 7 Ecuador              1.17e11
 8 Hong Kong SAR, China 3.59e11
 9 Angola               1.04e11
10 Russian Federation   2.27e12
# ℹ 50 more rows

Countries’ GDP

I now calculate the average of these responses, which I find to be:

sample_df |> 
  summarise(avg_gdp = scales::dollar(mean(gdp, na.rm = T)))
# A tibble: 1 × 1
  avg_gdp         
  <chr>           
1 $894,311,425,331

Now, imagine that we knew definitively that it was NA. Why such a large difference?

Non-response bias

Poorer countries are far less likely to be able or willing to provide these economic data to academics or international organizations.

  • They tend to be underrepresented in a lot of data

My sample was biased against poorer countries.

  • They were not equally likely to respond to my request for data as rich countries